Search CORE

83 research outputs found

Soft-Label Dataset Distillation and Text Dataset Distillation

Author: Schonlau Matthias
Sucholutsky Ilia
Publication venue
Publication date: 05/05/2020
Field of study

Dataset distillation is a method for reducing dataset sizes by learning a small number of synthetic samples containing all the information of a large dataset. This has several benefits like speeding up model training, reducing energy consumption, and reducing required storage space. Currently, each synthetic sample is assigned a single `hard' label, and also, dataset distillation can currently only be used with image data. We propose to simultaneously distill both images and their labels, thus assigning each synthetic sample a `soft' label (a distribution of labels). Our algorithm increases accuracy by 2-4% over the original algorithm for several image classification tasks. Using `soft' labels also enables distilled datasets to consist of fewer samples than there are classes as each sample can encode information for multiple classes. For example, training a LeNet model with 10 distilled images (one per class) results in over 96% accuracy on MNIST, and almost 92% accuracy when trained on just 5 distilled images. We also extend the dataset distillation algorithm to distill sequential datasets including texts. We demonstrate that text distillation outperforms other methods across multiple datasets. For example, models attain almost their original accuracy on the IMDB sentiment analysis task using just 20 distilled sentences. Our code can be found at

\href{https://github.com/ilia10000/dataset-distillation}{\text{https://github.com/ilia10000/dataset-distillation}}

arXiv.org e-Print Archive

On the Equivalence of Common Approaches to Cross Sectional Weights in Household Panel Surveys

Author: Martin Kroh
Matthias Schonlau
Publication venue
Publication date
Field of study

The computation of cross sectional weights in household panels is challenging because household compositions change over time. Sampling probabilities of new household entrants are generally not known and assigning them zero weight is not satisfying. Two common approaches to cross sectional weighting address this issue: (1) "shared weights" and (2) modeling or estimating unobserved sampling probabilities based on person-level characteristics. We survey how several well-known national household panels address cross sectional weights for different groups of respondents (including immigrants and births) and in different situations (including household mergers and splits). We show that for certain estimated sampling probabilities the modeling approach gives the same weights as fair shares, the most common of the shared weights approaches. Rather than abandoning the shared weights approach when orphan respondents (respondents in households without sampling weights) exist, we propose a hybrid approach; estimating sampling weights of newly orphan respondents only.BHPS, HILDA, PSID, SOEP, modeled weights, shared weights, fair shares

Research Papers in Economics

Nearest Labelset Using Double Distances for Multi-label Classification

Author: Gweon Hyukjun
Schonlau Matthias
Steiner Stefan
Publication venue
Publication date: 15/02/2017
Field of study

Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this paper we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of Hamming loss, 0/1 loss, and multi-label accuracy and ranks second after ECC on the F-measure

arXiv.org e-Print Archive

What do web survey panel respondents answer when asked “Do you have any other comment?”

Author: Schonlau Matthias
Publication venue: DEU
Publication date: 01/01/2015
Field of study

Near the end of a web survey respondents are often asked whether they have additional comments. Such final comments are usually ignored, partially because open-ended questions are more challenging to analyze. A random sample of final comments in the LISS panel and Dutch immigrant panel were categorized into one of nine categories (neutral, positive, multiple subcategories of negative). While few respondents chose to make a final comment, this is more common in the Immigrant panel (5.7%) than in the LISS panel (3.6%). In both panels there are slightly more neutral than negative comments, and very few positive comments. The number of final comments about unclear questions was 2.7 times larger in the immigrant panel than in the LISS panel. The number of final comments complaining about survey length on the other hand was 2.7 times larger in the LISS panel than in the immigrant panel. Researchers might want to consider additional pretesting of questions when fielding a questionnaire in the immigrant panel

SSOAR - Social Science Open Access Repository

Household Survey Panels: How Much Do Following Rules Affect Sample Size?

Author: Martin Kroh
Matthias Schonlau
Nicole Watson
Publication venue
Publication date
Field of study

In household panels, typically all household members are surveyed. Because household composition changes over time, so-called following rules are implemented to decide whether to continue surveying household members who leave the household (e.g. former spouses/partners, grown children) in subsequent waves. Following rules have been largely ignored in the literature leaving panel designers unaware of the breadth of their options and forcing them to makead hoc decisions. In particular, to what extent various following rules affect sample size over time is unknown. From an operational point of view such knowledge is important because sample size greatly affects costs. Moreover, the decisionof whom to follow has irreversible consequences as finding household members who moved out years earlier is very difficult. We find that household survey panels implement a wide variety of following rules but their effect on sample size is relatively limited. Even after 25 years, the rule "follow only wave 1 respondents" still captures 85% of the respondents of the rule "follow everyone who can be traced back to a wave 1 household through living arrangements". Almost all of the remaining 15% live in households of children of wave 1 respondents who have grown up (5%) and in households of former spouses/partners (10%). Unless attrition is low, there is no danger of an ever expanding panel because even wide following rules do not typically exceed attrition.Survey panels, Survey methodology

Research Papers in Economics

Coding Text Answers to Open-ended Questions: Human Coders and Statistical Learning Algorithms Make Similar Mistakes

Author: He Zhoushanyue
Schonlau Matthias
Publication venue: DEU
Publication date: 01/01/2021
Field of study

Text answers to open-ended questions are often manually coded into one of several predefined categories or classes. More recently, researchers have begun to employ statistical models to automatically classify such text responses. It is unclear whether such automated coders and human coders find the same type of observations difficult to code or whether humans and models might be able to compensate for each other’s weaknesses. We analyze correlations between estimated error probabilities of human and automated coders and find: 1) Statistical models have higher error rates than human coders 2) Automated coders (models) and human coders tend to make similar coding mistakes. Specifically, the correlation between the estimated coding error of a statistical model and that of a human is comparable to that of two humans. 3) Two very different statistical models give highly correlated estimated coding errors. Therefore, a) the choice of statistical model does not matter, and b) having a second automated coder would be redundant

SSOAR - Social Science Open Access Repository

Automated classification for open-ended questions with BERT

Author: Gweon Hyukjun
Schonlau Matthias
Publication venue
Publication date: 25/04/2023
Field of study

Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. Recently, pre-training a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than other non-pre-trained statistical learning approaches. We found fine-tuning the pre-trained BERT parameters is essential as otherwise BERT's is not competitive. Second, we found fine-tuned BERT barely beats the non-pre-trained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT's relative advantage increases rapidly when more manually coded observations (e.g. 200-400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting

arXiv.org e-Print Archive

ConvART: Improving Adaptive Resonance Theory for Unsupervised Image Clustering

Author: Schonlau Matthias
Sucholutsky Ilia
Publication venue: 'University of Waterloo'
Publication date: 24/12/2018
Field of study

While supervised learning techniques have become increasinglyadept at separating images into different classes, these techniquesrequire large amounts of labelled data which may not always beavailable. We propose a novel neuro-dynamic method for unsuper-vised image clustering by combining 2 biologically-motivated mod-els: Adaptive Resonance Theory (ART) and Convolutional Neu-ral Networks (CNN). ART networks are unsupervised clustering al-gorithms that have high stability in preserving learned informationwhile quickly learning new information. Meanwhile, a major prop-erty of CNNs is their translation and distortion invariance, whichhas led to their success in the domain of vision problems. Byembedding convolutional layers into an ART network, the usefulproperties of both networks can be leveraged to identify differentclusters within unlabelled image datasets and classify images intothese clusters. In exploratory experiments, we demonstrate thatthis method greatly increases the performance of unsupervisedART networks on a benchmark image dataset

Waterloo Library Journal Publishing Service (University of Waterloo, Canada)